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ABSTRACT 

Techniques for unsupervised discovery of acoustic patterns are get¬ 
ting increasingly attractive, because huge quantities of speech data 
are becoming available but manual annotations remain hard to ac¬ 
quire. In this paper, we propose an approach for unsupervised discov¬ 
ery of linguistic structure for the target spoken language given raw 
speech data. This linguistic structure includes two-level (subword¬ 
like and word-like) acoustic patterns, the lexicon of word-like patterns 
in terms of sub word-like patterns and the N-gram language model 
based on word-like patterns. All patterns, models, and parameters 
can be automatically learned from the unlabelled speech corpus. This 
is achieved by an initialization step followed by three cascaded stages 
for acoustic, linguistic, and lexical iterative optimization. The lex¬ 
icon of word-like patterns defines allowed consecutive sequence of 
HMMs for sub word-like patterns. In each iteration, model training 
and decoding produces updated labels from which the lexicon and 
HMMs can be further updated. In this way, model parameters and 
decoded labels are respectively optimized in each iteration, and the 
knowledge about the linguistic structure is learned gradually layer 
after layer. The proposed approach was tested in preliminary exper¬ 
iments on a corpus of Mandarin broadcast news, including a task of 
spoken term detection with performance compared to a parallel test 
using models trained in a supervised way. Results show that the pro¬ 
posed system not only yields reasonable performance on its own, but 
is also complimentary to existing large vocabulary ASR systems. 

Index Terms — unsupervised learning, hidden Markov models, 
spoken term detection, zero resource speech recognition, iterative op¬ 
timization 


1. INTRODUCTION 

Supervised training of HMMs for automatic speech recognition re¬ 
lies on not only collecting huge quantities of acoustic data, but also 
obtaining the corresponding precise labels. Such supervised training 
method yields adequate performance in most circumstances but with 
high cost, and in many situations such annotated data sets are simply 
not available. This is why substantial effort ITl-fBl has been made 
for unsupervised discovery of acoustic patterns from huge quantities 
of acoustic data which may be easily obtained nowadays, without 
manual labels and corresponding knowledge. Most of such effort dis¬ 
covered only one level of phoneme like acoustic patterns. However, it 
is well known that speech signals have multi-level structure including 
at least phoneme and words, and such structure are very helpful in 
analysing or decoding speech ifTfil . 


In this paper we propose an approach for unsupervised discov¬ 
ery of structured two-level acoustic patterns including subword-like 
patterns and word-like patterns (concatenation of several subword¬ 
like patterns). Not only the HMMs for these patterns, the number of 
the subword-like patterns and the lexicon size of word-like patterns 
can be automatically learned from data, but more knowledge about 
the language such as the N-gram language model and the word-like 
pattern lexicon, jointly referred to as the linguistic structure in this pa¬ 
per, can all be obtained directly from the acoustic signals of a corpus. 
This is achieved by integrating a dynamic lexicon into the process 
of the conventional supervised HMM-training, and performing three 
stages of iterative optimization between the labels and the models, 
such that the models, parameters, and the linguistic structure can then 
collect knowledge from the corpus layer after layer iteratively and 
adjust themselves accordingly. In this way, we are able to develop 
semantic building blocks of the target spoken language represented 
by the corpus with word-like patterns and acoustic building blocks of 
the target spoken language with sub word-like patterns. 

2. PROPOSED APPROACH: CASCADED THREE STAGES 
OE ITERATIVE OPTIMIZATION 

The goal is to find the parameter set 9 — ,0"^ ,6^} for the linguistic 

structure and the word-like pattern label W given the observed acous¬ 
tic feature vector sequences O for the corpus considered. The param¬ 
eter set 0 includes three parts: 0^ for acoustic HMMs of sub word-like 
patterns, 0^ for lexicon of word-like patterns in terms of subword¬ 
like pattern sequences, and 6^ for N-gram word-like pattern language 
model. This is achieved by first finding an initial label Wo for the 
observation O as in ([^. In each iteration z, we train the parameters 
Oi with the label Wi-i obtained in the previous iteration as in and 
decode the label Wi with the obtained parameters Oi as in ([^. 



= initialization (O), 
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The iterations above are organized as an initialization step fol¬ 
lowed by three cascaded stages (I)(II)(III) respectively for acoustic, 
linguistic and lexical optimization as shown in Fig. In Fig. 
the number of iterations for each stage are la, h and Ix respectively. 
When the difference between Wi-i, Wi becomes insignificant, the 
process then advances to the next stage. The parameters Oi are gen¬ 
erated by EM training as in ([^, while the other parameters 0\ or Of 


are generated directly from the labels Wi-i obtained in the previous 
iteration. However, not all of and 0^ are used in each stage. 

The detailed updating procedure is depicted in Fig. and will be 
explained shortly. 

The basic idea behind the procedure in Fig. [^is to gradually con¬ 
struct and update the parameters layer after layer. This prevents the 
parameters from being caught in local optimal situations which often 
happen when too many parameters are optimized at once. First, the 
HMM parameters for the subword-like patterns are trained alone in 
stage (I), because these HMMs are the primary building blocks of the 
whole linguistic structure and reliable estimate for their parameters 
is the key. With reliable enough HMMs for subword-like patterns, 
we then in stage (II) use N-gram parameters for word-like patterns to 
better decode those word-like patterns frequently appearing together 
while continuously updating the HMM parameters. Finally in the 
stage (III), we break the word-like patterns into sub word-like patterns 
and reconstruct better word-like patterns. The number of word-like 
patterns in the lexicon may shrink in the iterations of the first two 
stages because some less frequent patterns can be absorbed by other 
patterns, but this number can be changed significantly in the third 
stage. The time alignment for the subword-like patterns are updated 
in all iterations when the the labels Wi are decoded. 
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Fig. 2. Detailed diagrams for the three stages of (a)acoustic 
(b)linguistic and (c)lexical optimization. 


Initialization Stage(I):Acoustic Stage(II): Linguistic Stage(III): Lexical 

Step Optimization Optimization Optimization 



0 — observations 0“ - HMMs for subword-like patterns 

W — word-like pattern labels 0* - word-like pattern N-gram language model 
i - iteration index 0^ - word-Uke pattern lexicon 


Fig. 1. Simplified digram for the proposed initialization step fol¬ 
lowed by three stages of iterative optimization. The four phase it¬ 
erative optimization procedure. Some dependency links have been 
omitted. 


clustering on these representative vectors obtained from the whole 
corpus. The number of clusters (the initial number of sub word-like 
patterns) is determined by the ratio of the within-cluster total scatter¬ 
ing to the between-cluster total scattering. A sub word-like pattern ID 
is then assigned to each cluster. A distinct sequence of consecutive 
subword-like patterns for word-like segments then defines a word¬ 
like pattern, and the total number of distinct word-like patterns in the 
corpus is the initial vocabulary size of the lexicon. The corpus is thus 
represented by its initial labels Wo. 



Fig. 3. An example dotplot and its watershed transform. 


2.1. Initialization Step 

Here we initialize the labels in a top-down fashion by first breaking 
each utterance into word-like segments based on the discontinuities 
in a parameter evaluated from energy and MFCC features. For each 
word-like segment, we further divide it into subword-like segments in 
the following way. We perform a watershed transform on the filtered 
self-similarity dotplot (nl for acoustic features of each hypothesized 
word-like segment. Watershed transformation is able to capture the 
number of objects and their borders in a gray scale image fTSl . So, 
the intersections of the diagonal entries of the dot-plot with the wa¬ 
tershed transform object borders are taken as the boundaries between 
sub word-like segments. An example dotplot and its watershed trans¬ 
form including the hypothesized subword-like segment boundaries is 
shown in Fig. 

We then extract an average representative feature vector for ev¬ 
ery hypothesized subword-like segment, and perform global k-means 


2.2. Stage(I): Acoustic Optimization 

The process in stage(I) is shown in Fig. [^a). In each iteration, the 
acoustic model set Oi are the HMMs trained from the corpus based on 
Wi with the ML criterion. The lexicon 0^ is derived by collecting all 
word-like patterns appearing in Wi with counts exceeding a thresh¬ 
old. Free word decoding is then performed on the whole corpus O 
based on Of and Of, producing an updated label W^+i. When Wi 
is updated to Wz+i, not only the HMM parameters of Of and HMM 
segmentation boundaries are updated, but the vocabulary size of Of 
may shrink when the counts of some word-like patterns become small 
enough. 

2.3. Stage(II):Linguistic Optimization 

This stage is shown in Fig. I3b), which is very similar to the previ¬ 
ous stage. The only difference is an N-gram language model 0\ for 



































the word-like patterns is estimated from the label Wi and is used in 
decoding to produce the updated labels W^+i. The N-grams help pro¬ 
duce better labels Wi+i especially for word-like patterns appearing 
together frequently. 

2.4. Stage(III):Lexical Optimization 

We reconstruct new word-like patterns in this step as in Fig. He). 
This is done by breaking the word-like patterns in into subword¬ 
like patterns, and then reconstructing new word-like patterns based 
on Wi. Those segments of several consecutive subword-like patterns 
appearing frequent enough and with high enough right and left con¬ 
text variation are taken as word-like patterns. This can be achieved 
by constructing an efficient data structure called PAT-Tree using the 
labels WifM- In this way, the lexicon Of can be updated signifi¬ 
cantly in each iteration. This updated lexicon Of is then used in free- 
word decoding to produce the labels Wi+i. The whole process is 
completed when there is no significant difference between Wi and 
Wi-^i. This gives the automatically discovered linguistic structure 
0 = {0^,0"^,0^}, where 0^ is trained from the final version of Wz+i. 

3. EXPERIMENTS 

3.1. Experimental Setup 

The proposed approach was tested in the preliminary experiments 
performed on a corpus of Mandarin broadcast news collected in Tai¬ 
wan in 2001 with length of 4 hours including 5034 utterances. The 
HMMs used for each sub-word like pattern had 13 states, each with 
only 1 Gaussian component. This configuration was selected due 
to the assumption that the subword-like patterns of interest should 
describe more signal trajectory variation and less acoustic variation. 
Signal segments with larger acoustic variation should be classified as 
different patterns. The final linguistic structure including all patterns, 
models and parameters was obtained by performing 30 iterations in 
each stage (I)(II)(III) in Fig. [^on the entire corpus. 

3.2. Initial Observations and Analysis 

It is interesting that almost all the 208 subword-like patterns obtained 
here roughly correspond to Mandarin syllables (each Chinese char¬ 
acter is pronounced as a Mandarin syllable). A global view of the 
exact mapping relation from the 208 subword-like patterns to the 
total of 399 Mandarin syllables manually labelled for the corpus is 
shown in Fig. The Mandarin syllables on the horizontal scale 
of the figure have been sorted according to acoustic similarity (only 
a quarter of them are explicitly printed due to limited space). Ev¬ 
ery circle here represents 35 or more subword-like patterns on the 
vertical scale whose central feature frame belonged to the Mandarin 
syllable in the horizontal scale. This figure implied a very-close- 
to one-to-one mapping relation with some fuzziness around neigh¬ 
bouring syllables with similar acoustic behaviour. The 362 word-like 
pattern obtained corresponded to roughly 154 frequently occurring 
multi-syllable words and 208 monosyllables (or mono-sub word-like 
patterns). Those words occurring not frequently enough couldn’t be 
discovered and as a result were represented as one to several mono¬ 
sub word-like patterns. 

Fig. 0 further illustrates how the number of subword-like pat¬ 
terns, lexicon size of word-like patterns, the consistency between 
Wi-i and Wi at word-like pattern level and utterance level changed 
with respect to iterations. In a global perspective, lexicon size of 
word-like patterns dropped in the stages (I) and (II), and jumped and 


oscillated in stage (III). Although most word-like patterns in stage (I) 
did not survive by the end of stage (II), the main purpose of them was 
to provide some context guidance for the training of subword-like 
HMMs. 



Fig. 4. Mapping relation between the discovered subword-like pat¬ 
terns and Mandarin syllables. Only pairs with 35 or more occurrence 
are shown, and the average co-occurrence mapping for all circles in 
the figure is 331. 


3.3. Justification of the Initialization and Iterative Stages 

We performed further tests with configurations slightly different from 
the proposed approach on a subset of 942 utterances out of the 5034 
in the tested corpus. We evaluated the syllable accuracy by mapping 
every discovered subword-like pattern to a corresponding Mandarin 
syllable (as was done in Fig. for each configuration considered. 
In the first part, we initialized Wo with 3 different methods and then 
applied 50 iterations of stage (I) only. The three methods are (1) 
the proposed two-level top-down labelling started with word-like seg¬ 
ments, (2) sub word initialization with only watershed transform, but 
without higher level word-like segments, (3) same as (2) but with¬ 
out k-means clustering, with same number of subword-like pattern 
IDs randomly assigned to each sub word-like segment. The main dif¬ 
ference between methods (1)(2) was the two-level pattern structure. 
Method (1) brought us halfway through the proposed approach (ini¬ 
tialization and stage (I)) producing two-level patterns, while method 

(2) was similar to the unsupervised initialization methods used pre¬ 
viously with one-level patterns only ifTllfTTl . The results are in the 
left half of Table Although method (1) was only 1.03% better than 
method (2), the patterns obtained with method (1) manual auditing 
tests suggest that the improvement is non-trivial. This verified the 
word-like pattern constraints were useful in the acoustic optimization 
process. The random ID assignments without clustering in method 

(3) also offered relatively high accuracy. This implied the acoustic 
optimization iterations in stage (I) was quite helpful. 

In the second part, we initialized Wo with the two-layered 
method then applied 3 different iteration sequences: (1) 
h) = (30,20,0), (2) {lajijx) = (50,0,0), (3) {lajijx) = 
(0,50,0). Method (1) brought us halfway through the proposed 




Fig. 5. Number of subword-like patterns, lexicon size for word¬ 
like patterns (left) and consistency between Wi and in terms 

of word-like patterns and utterances (right) as functions of iterations. 
The transition from stage(I) to stage(II) and stage(II) to stage(III) hap¬ 
pened at iteration 30 and 60 respectively. 


(A)Initialization methods 

(B)Iteration methods 

(l)Two-level 

38.96% 

(l)(7c,/i,4)=(30,20,0) 

39.45% 

(2)One-level 

37.93% 

(2)(7c.,7,,4)=(50A0) 

38.96% 

(3)Random 

35.76% 

(3)(7,,7,,7,)=(0,50,0) 

37.08% 


Table 1. ASR accuracy of unsupervised transcription translated by 
string replacement with most probable assignment 


approach wile method (3) was actually the intuitive joint optimiza¬ 
tion of both acoustic and linguistic parameters similar to previously 
proposed approaches lUO. The results are in the right half of Table 
The proposed method (1) was 2.37% better than the joint opti¬ 
mization method (3). The proposed method (1) was also better than 
the applying method (2) alone, which implies that the transition was 
the source of improvement. This verified that gradually learning later 
after layer yielded more reliable results. The benefits of the lexical 
optimization in stage (III), on the other hand, are better observed 
in a companion paper on semantic retrieval of spoken content also 
submitted to ICASSP 2013 1^ . since the word-like patterns carried 
semantics. 

3.4. Spoken Term Detection 

We also applied the discovered patterns on a task of spoken term de¬ 
tection ll^ - ll?7l and compared to a set of Mandarin syllable models 
trained on a manually annotated corpus of 24.5 hours of Mandarin 
Broadcast News with a trigram for 72k vocabulary used in recogni¬ 
tion. The performance of the supervised HMMs serves as an upper 
bound for the performance of our unsupervised HMMs. We tested 
the performance of the supervised and unsupervised models under 
the same scenario. The query set consisted of 52 name entities of 
countries, organizations and political leaders. For each query, we de¬ 
coded their corresponding utterances in the corpus and selected the 
most frequent HMM sequence to represent each query (equivalent to 
query by one example of the best query utterance). Syllable HMMs 
were used for the supervised case, and subword-like pattern HMMs 
were used for the unsupervised case. This query HMM sequence 


was then compared with the HMM sequences of all utterances in the 
corpus for evaluation of the relevance scores for retrieval. We first 
computed offline the distance between each pair of two HMMs. The 
distance between two HMMs was defined to be the DTW-distance be¬ 
tween the two state sequences. One state in a HMM can be matched 
with several states in another HMM and vice versa. The distance 
metric used for DTW was the KL-divergence between the two Gaus¬ 
sian mixtures ED. We then calculate the distance between the query 
HMM sequences and corpus HMM sequences online. The distance 
between two HMM sequences was defined to be the sum of distances 
for matched pairs of models for the two sequences. Since most com¬ 
putation was done offline, this method was as fast as text information 
retrieval. 


Retrieval Performance of Combined Distance 



Fig. 6. The spoken term detection performance based on the 
weighted sum of unsupervised(left) and supervised(right) distance 
metrics. 

We took the weighted sum of the supervised distance dg and un¬ 
supervised distance du, and performed spoken term detection based 
on the combined distance dx = A x ds + (1 — A) x The results in 
Fig. I^show that reasonable detection performance was achieved for 
the unsupervised model on its own (A = 0). More importantly, the 
combined distance can yield better results in all the three measures 
than using only supervised or unsupervised distances. This implies 
that the proposed method has successfully harvested information di¬ 
rectly from the data that was lost during recognition with the super¬ 
vised models. In other words, the proposed method not only performs 
reasonably well on its own, but it is also complimentary to standard 
supervised ASR systems. 

4. CONCLUSION 

This work presents an approach for unsupervised discovery of lin¬ 
guistic structure including two-level acoustic patterns from a cor¬ 
pus. The main difference from similar approaches proposed ear¬ 
lier (H HO] HID (6)0] lies in the two-level acoustic patterns and 
the layer-after-layer gradual learning of the model parameters with 
cascaded stages of iterative optimization. Although some earlier ap¬ 
proaches 111 also took hierarchical knowledge into consideration, our 
work used 13-state single Gaussian HMMs as compared to the con¬ 
ventional HMMs with smaller number of states and multi-Gaussian 
(DOKDID to model the trajectories of acoustic patterns with less 
acoustic variation. The preliminary experiment on spoken term de¬ 
tection on subword-like pattern sequences indicated that the proposed 
system is complimentary to existing ASR systems. A more complete 
experiment on spoken term detection in a companion paper submitted 
to ICASSP 2013 ca demonstrates how our model can outperform 
the segmental DTW approach. Also, the second level of word-like 
patterns are aimed to capture some semantic features in the acous¬ 
tic signal, which can be verified in a companion paper on Semantic 

































retrieval of spoken content also submitted to ICASSP 2013 Eo). 
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